Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes Indexed assignment of extra columns in SolutionArray #838

Merged
merged 15 commits into from
Sep 1, 2020

Conversation

sin-ha
Copy link
Contributor

@sin-ha sin-ha commented Apr 1, 2020

Checklist

  • There is a clear use-case for this code change
  • The commit message has a short title & references relevant issues
  • Build passes (scons build & scons test) and unit tests address code coverage
  • The pull request is ready for review

If applicable, fill in the issue number this pull request is fixing

Fixes #829

Changes proposed in this pull request

-Earlier the self._extra[name] was initialised as a list but when the set item method was called it returned a np.array() copy of it
-So I propose to initialise the init method with the np.array and amended other methods to abide with it
-I think that this would be a better implementation.

  • This solves item assignment as the original object gets passed instead of the copy

@@ -523,9 +523,9 @@ def __init__(self, phase, shape=(0,), states=None, extra=None):
"Unable to create extra column '{}': name is already "
"used by SolutionArray objects.".format(name))
if not np.shape(v):
self._extra[name] = [v]*self._shape[0]
self._extra[name] = np.array([v]*self._shape[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is a pre-existing choice, but It may make sense to allow for more dimensions? I.e. this would be a case where we have something like

   >>> states = ct.SolutionArray(gas, (6, 10), extra=[‘foo’, ‘bar’])

Copy link
Contributor Author

@sin-ha sin-ha Apr 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't clearly understand what you want to say. I guess you want to say foo and bar to be extra arrays of size (6,10)?
like

>>> states = ct.SolutionArray(gas, (6,10), extra={'foo':sd,"bar":sd})

where sd is any (6,10) list/np array
??

Copy link
Member

@ischoegl ischoegl Apr 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sin-ha ... what I wrote was deliberate - no typo there. It currently throws a ValueError, but you could argue that it should initialize np.ndarray's with the full _shape, not just the first dimension. From my perspective, the shape of the _extra entries should be consistent with other data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ischoegl I'm also not sure what you're suggesting, but it seems like it would be a new feature? In that case, I think it seems outside the scope of this fix.

Copy link
Contributor Author

@sin-ha sin-ha Apr 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ischoegl As of now extra expects dicts unless the shape of SolutionArray is trivial

  • If the user passes a list we should be able to initiate the extra array with same shape.
  • else the user will pass that as a dict , in that case we should validate extra with the complete SolutionArray shape instead of self._shape[0]

Copy link
Contributor Author

@sin-ha sin-ha Apr 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also do think this is more of adding another functionality, seems a bit off of the issue so should it be pulled here or open another issue?

Copy link
Member

@ischoegl ischoegl Apr 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point I raised is a trivial addition, and the application was stated in my first post: it’s a reasonable assumption that the arrays are instantiated with correct dimensions based on the information already provided. My guess is that the former single-dimensioned approach was based on _extra formerly being list-based. Now that this is no longer the case, putting in a more consistent approach is trivial.

What I am talking about affects some of the same lines that are changed already and won’t take more than a couple of additional lines. Hence, I’d like to see it considered as a friendly amendment of the original issue 😉

if isinstance(extra,str):
extra = [extra]

if isinstance(extra, list) and all(isinstance(name, str) for name in extra):
Copy link
Member

@ischoegl ischoegl Apr 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's captures almost everything! However, extra could also be an instance of tuple or ndarray of strings? I believe all of these are supported for when no shape is used for the instantiation of SolutionArray (i.e. I am talking about the former elif extra and self._shape == (0,): case, in which case any of the above can be iterated over).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that'd be a quick fix but I don't think np.arrays are important?.
Besides I believe the other part should be removed where self._shape is trivial as after this commit it would only be a desperate attempt to map something in extra, probably not useful (is there a particular use case?). it lacks checks and something weird like integers, for instance can pass which can't be used back.e.g.

>>>>states3 = ct.SolutionArray(gas, shape=(0,0,0), extra=[1])

Copy link
Member

@ischoegl ischoegl Apr 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think that the alternative case can be executed for useful content either (you already caught all useful cases), I.e. I believe it can be deleted. If I’m not mistaken, the unit tests should still pass.

On a formatting note, there are a couple of spaces missing after commas (e.g. it should be shape=(0, 0, 0) in the previous post; same for some other lines you edited). It’s minor, but the code should remain in compliance with PEP 8.

PS: ad ndarrays ... again it’s minor but it should be included for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the unit tests pass.
sorry for the formatting though.

Copy link
Member

@bryanwweber bryanwweber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sin-ha! Thanks for taking this on. I've got several comments below, I think we need to change how this is implemented. Let me know your thoughts, or if you have any questions/concerns!

interfaces/cython/cantera/composite.py Outdated Show resolved Hide resolved
interfaces/cython/cantera/composite.py Outdated Show resolved Hide resolved
interfaces/cython/cantera/composite.py Outdated Show resolved Hide resolved
interfaces/cython/cantera/composite.py Outdated Show resolved Hide resolved
@@ -590,7 +596,7 @@ def append(self, state=None, **kwargs):
raise IndexError("Can only append to 1D SolutionArray")

for name, value in self._extra.items():
value.append(kwargs.pop(name))
np.append(value,kwargs.pop(name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
np.append(value,kwargs.pop(name))
np.append(value, kwargs.pop(name))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this isn't correct, np.append does not work in-place. See the documentation: https://numpy.org/doc/1.18/reference/generated/numpy.append.html This will most likely be very slow, since NumPy allocates a new array every time np.append() is called. I wonder if it would be better to cast the values in _extra to lists and use the list append method, and then cast them back to arrays? Maybe the complexity isn't worth it though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also can't find any other way around see

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean, any other way around? You'll have to find some way, since this won't work as expected. I think the simplest method is just to reassign value: value = np.append(value, kwargs.pop(name)) That might be slow, since NumPy has to create a new array every time, but hopefully not too slow.

interfaces/cython/cantera/composite.py Outdated Show resolved Hide resolved
@bryanwweber
Copy link
Member

I forgot to review the test file as well! Sorry, my main comment there was that I think each time you have a comment that delimits a different set of tests, that should be a new test function.

@bryanwweber
Copy link
Member

bryanwweber commented Jun 29, 2020

@sin-ha Are you going to be able to address my comments? Thanks 😄

@sin-ha
Copy link
Contributor Author

sin-ha commented Jun 29, 2020

Yes @bryanwweber, actually i was a bit busy with my never ending university exams and forgot about it
Sorry for the late reply. I will look into everything asap.

@bryanwweber
Copy link
Member

Thanks @sin-ha I certainly understand having lots of exams 😊

@speth speth changed the base branch from master to main June 30, 2020 23:14
@sin-ha sin-ha force-pushed the Fix#829 branch 3 times, most recently from edebe9b to d9ec808 Compare July 2, 2020 12:08
Copy link
Member

@bryanwweber bryanwweber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes @sin-ha, just a few things left!

interfaces/cython/cantera/composite.py Outdated Show resolved Hide resolved
interfaces/cython/cantera/composite.py Show resolved Hide resolved
interfaces/cython/cantera/composite.py Outdated Show resolved Hide resolved
interfaces/cython/cantera/composite.py Outdated Show resolved Hide resolved
interfaces/cython/cantera/composite.py Outdated Show resolved Hide resolved
@sin-ha
Copy link
Contributor Author

sin-ha commented Jul 2, 2020

done @bryanwweber .
Meanwhile, may I also add my name to Cantera/AUTHORS ?. I don't know if I qualify though 😅

Copy link
Member

@bryanwweber bryanwweber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sin-ha. One change still to do, and please feel free to add yourself to AUTHORS!

@@ -608,7 +623,7 @@ def append(self, state=None, **kwargs):
raise IndexError("Can only append to 1D SolutionArray")

for name, value in self._extra.items():
value.append(kwargs.pop(name))
np.append(value, kwargs.pop(name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line still isn't fixed.

Copy link
Contributor Author

@sin-ha sin-ha Jul 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think this through before, Anyways

I think the simplest method is just to reassign value: `value = np.append(value, kwargs.pop(name))

Appending np.ndarrays is not as fast

Turns out converting them to lists and appending (as you said before). does a better job!!,

see https://stackoverflow.com/a/22394181 and https://stackoverflow.com/questions/7133885/fastest-way-to-grow-a-numpy-numeric-array

In [2]: %%timeit 
   ...: a = np.empty((0,3), int) 
   ...: l = list(a) 
   ...: for i in range(1000): 
   ...:     l.append([3*i+1,3*i+2,3*i+3]) 
   ...: l = np.array(l)                                                                                                                                                                                                   
467 µs ± 6.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [3]: %%timeit 
   ...: a = np.array([3,4,3]) 
   ...: for i in range(1000): 
   ...:     l = list(a) 
   ...:     l.append([3*i+1,3*i+2,3*i+3]) 
   ...:     l = np.array(l) 
   ...:                                                                                                                                                                                                    
2.27 ms ± 531 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: %%timeit 
   ...: a = np.empty((0,3), int) 
   ...: for i in range(1000): 
   ...:     a = np.append(a, 3*i+np.array([[1,2,3]]), 0) 
   ...:                                                                                                                                                                                                    
3.88 ms ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@bryanwweber
Copy link
Member

@sin-ha I pushed a few more commits to this branch to fix a few edge cases I found in some testing. For your reference, the changes are in commits b86786b, 1cc53b4, and 8786b51. The other commits fix a few other things I found while working on this code.

The main things to watch out for are:

  1. Make sure you test all the branches of an if/else conditional. There was a ValueError if you tried to create an extra item from only a single value (for example, extra={"prop1": 1}) due to the way you were using .reshape().
  2. Similarly, if you make a change, make sure to add a test to make sure that it works. There was no test of the append() for extra items. To be fair, there wasn't one before either... Adding a test caused me to add the fixes in 5524e99, 477cbed, and ef1fa99
  3. It's helpful to name test functions descriptively. You never actually have to call them anywhere, so using up the entire line with the function name isn't really a problem (sorry, I should have pointed this out before).

@bryanwweber
Copy link
Member

The test failures are due to using NumPy 1.11. The docs for .full() are: https://docs.scipy.org/doc/numpy-1.11.0/reference/generated/numpy.full.html#numpy.full

The default dtype for .full() in 1.11 is float, which means string values can't be included in extra arrays without specifying a different dtype on creating. NumPy 1.12 and up use np.array(fill_value).dtype as the default, so don't have this problem. I guess if we want to support NumPy 1.11 (which is the version available in the Ubuntu 16.04 repositories), we'll have to specify the dtype for all the extra arrays...

@ischoegl

This comment has been minimized.

@bryanwweber

This comment has been minimized.

@ischoegl

This comment has been minimized.

@ischoegl
Copy link
Member

ischoegl commented Jul 9, 2020

The test failures are due to using NumPy 1.11. The docs for .full() are: https://docs.scipy.org/doc/numpy-1.11.0/reference/generated/numpy.full.html#numpy.full

The default dtype for .full() in 1.11 is float, which means string values can't be included in extra arrays without specifying a different dtype on creating. NumPy 1.12 and up use np.array(fill_value).dtype as the default, so don't have this problem. I guess if we want to support NumPy 1.11 (which is the version available in the Ubuntu 16.04 repositories), we'll have to specify the dtype for all the extra arrays...

Ubuntu 16.04 support ends April 2021, which is less than a year out. Citing a conservative (albeit current) distro, CentOS8 uses numpy 1.14. GitHub's CI still uses 18.04 as ubuntu-latest, but it likely won't be long before it is upgraded to 20.04. I guess it boils down to a similar question as #891, where it's only a system-installed version that will cause issues?

PS: It appears that numpy requirements are actually not strictly enforced in SConstruct, and there’s only a warning issued?

@bryanwweber
Copy link
Member

Thanks for the thoughts on NumPy @ischoegl. It is really good to know that a conservative distribution like CentOS includes 1.14. Until this PR, there has not been a required minimum version of NumPy, we have been able to be flexible since we hadn't used very many of the user-interface features (basically, we just used the C headers to provide arrays in Cython). Hence, it was only necessary to provide a warning for versions that we hadn't tried of NumPy. Going forward, it seems it may be necessary to enforce an actual minimum version.

@ischoegl
Copy link
Member

ischoegl commented Jul 10, 2020

ad numpy ... Ubuntu 18.04 is still on 1.13.3, so this still puts some restrictions on new features, e.g. #896. I ended up testing for distutils.version.LooseVersion in #900 ...

bryanwweber added a commit to bryanwweber/cantera that referenced this pull request Jul 15, 2020
The minimum NumPy version is 1.12 to support the use of numpy.full() in
the SolutionArray interface introduced in Cantera#838.
bryanwweber added a commit to bryanwweber/cantera that referenced this pull request Jul 23, 2020
The minimum NumPy version is 1.12 to support the use of numpy.full() in
the SolutionArray interface introduced in Cantera#838.
speth pushed a commit that referenced this pull request Jul 27, 2020
The minimum NumPy version is 1.12 to support the use of numpy.full() in
the SolutionArray interface introduced in #838.
Copy link
Member

@speth speth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From reading the discussion, I think all of the previously-mentioned issues have been hashed out at this point, right? I just had one case where I think the behavior should change.

raise ValueError("Initial values for extra properties must be "
"supplied in a dict if the SolutionArray is not "
"initially empty")
self._extra[name] = np.empty(self._shape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't the 'extra' arrays should be allowed to contain uninitialized values. The existing behavior of requiring initial values if the SolutionArray is not empty should be preserved.

Copy link
Member

@ischoegl ischoegl Aug 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@speth ... I’d ask to consider retaining this behavior (while potentially avoiding np.empty).

Think of a scenario where extra is calculated based on the content of the SolutionArray itself (which typically receives the actual value using some setter only after it is instantiated, so the content of extra is not known a priori). For this case, you’d have to provide dummy values to create extra columns, which would be subsequently overwritten. I believe that automatically initializing values as zeros would be a lot more convenient (and avoid np.empty); columns that contain strings are presumably rare, so having to provide a value there would be ok.

Copy link
Member

@bryanwweber bryanwweber Aug 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ischoegl I originally had the same objection as you, but I considered it for a while and I ended up agreeing with @speth. There are two main reasons for this, as I see it:

  1. It is trivially easy to specify a dummy value, since you do not need to specify the shape. For instance:

    >>> arr = ct.SolutionArray(gas, (3, 10), extra={"prop": 0, "prop2": 0.0})
    >>> arr.prop
    array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
    >>> arr.prop2
    array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
           [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
           [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
    >>> # This is not allowed right now, but would be almost the same, except the dtype
    >>> arr = ct.SolutionArray(gas, (3, 10), extra=["prop", "prop2"])

    Since I've used np.full() to fill the initial value, there's no need to give an array as the initial value.

  2. It isn't clear what dtype should be used - it is not only strings, but whether one wants to use floats vs. ints or any other dtype as well. I suppose floats could be used in place of ints, but there are legitimate use cases for the int dtype, for instance, as a categorical or level variable. I think it would also not be clear why one would have to specify an initial value for some dtypes but not for others.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bryanwweber ... I didn’t think of the scalar route, so your argument is convincing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The availability of the scalar initializer is a nice compromise for the case where the value is going to be calculated anyway.

If I'm reading this block of code correctly now, the only time that this line can be called is if self._shape is (0,), right? It might be more clear if the value was just set to self._extra[name] = np.empty(0) then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct, it only runs if self._shape == (0,). I'm not sure it's clearer to just pass 0 though. Shapes of NumPy arrays are always returned as tuples, and just writing np.empty(0) implies to me that 0 will be the value in the array (notwithstanding the name empty), not that there's a shape of 0. So I think it would be better to use np.empty(shape=(0,)), which is then just self._shape, and putting in (0,) I thought would be more confusing than using self._shape, because the question arises, if these are equal why not just use the existing variable name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think explicitly showing the size as zero (with whatever notation you like) is preferable because it makes it clear that this doesn't result in an array with uninitialized values. Tracing the logic back to see that self._shape is (0,) if we're in this branch is not really that simple.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this is fixed now with shape=(0,) 👍

sin-ha and others added 13 commits August 21, 2020 20:12
Fix the formatting of the two KeyErrors in SolutionArray.append(). The
first one was missing the format method on the string. The second was
raising a double-KeyError due to the way the except is handled.
When appending to a SolutionArray, any extra values must be specified
as kwargs. If they aren't present, the array lengths would become out of
sync, or there would be a KeyError on the kwargs dictionary. Also adds a
test for the failure modes of SolutionArray.append().
Move append of extra values to the end of SolutionArray.append(). If the
append is done at the beginning of the function, the append can happen
even if the state is invalid. This would cause the length of the arrays
to become out of sync.
If a single value (e.g., integer or float) is passed as the value for an
extra column, the array with the single value cannot be reshaped. Use
np.full() instead. Add/rename tests for creating extra items by dicts
and iterables, along with tests of the failure conditions.
Make sure that ndarrays are one-dimensional. Move check for bare string.
Clarify error messages. Reduce indentation. Add tests for creating extra
items from bare strings and ndarrays.
This change bumps the builder that uses the system Python 2 to run SCons
and the system Python 3 for the Python interface to Ubuntu 18.04 from
16.04. The reason for this change is that Ubuntu 16.04 provides NumPy
1.11, which does not support flexible dtypes when creating arrays
using .full(). We have decided to drop support for NumPy older than 1.12
for this reason. Numpy 1.12 was released in January 2017.
If the extra columns are passed to the SolutionArray, they must either
have initial values or the SolutionArray must be empty.
@speth speth merged commit 3f455e1 into Cantera:main Sep 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Indexed assignment of extra columns in SolutionArray fails
4 participants